Libraries:
The Red Wine Quality data set contains 1,599 red wines with 11 variables on the chemical properties of the wine. At least 3 wine experts rated the quality of each wine, providing a rating between 0 (very bad) and 10 (very excellent).
Input variables (based on physicochemical tests):
Output variable (based on sensory data):
For this analysis, we are mainly looking to answer one question: Which chemical properties influence the quality of red wines?
To understand a little bit better the Red Wine Quality dataset, the first step is to take a look in the summary of variables contained in it. With this summary it’s possible to check how spread is the values, by checking the min and max values. It’s also possible to have a quick understanding about the the distribution of the data, by comparing the mean and the median values.
## [1] "X" "fixed.acidity" "volatile.acidity"
## [4] "citric.acid" "residual.sugar" "chlorides"
## [7] "free.sulfur.dioxide" "total.sulfur.dioxide" "density"
## [10] "pH" "sulphates" "alcohol"
## [13] "quality"
## X fixed.acidity volatile.acidity citric.acid
## Min. : 1.0 Min. : 4.60 Min. :0.1200 Min. :0.000
## 1st Qu.: 400.5 1st Qu.: 7.10 1st Qu.:0.3900 1st Qu.:0.090
## Median : 800.0 Median : 7.90 Median :0.5200 Median :0.260
## Mean : 800.0 Mean : 8.32 Mean :0.5278 Mean :0.271
## 3rd Qu.:1199.5 3rd Qu.: 9.20 3rd Qu.:0.6400 3rd Qu.:0.420
## Max. :1599.0 Max. :15.90 Max. :1.5800 Max. :1.000
## residual.sugar chlorides free.sulfur.dioxide
## Min. : 0.900 Min. :0.01200 Min. : 1.00
## 1st Qu.: 1.900 1st Qu.:0.07000 1st Qu.: 7.00
## Median : 2.200 Median :0.07900 Median :14.00
## Mean : 2.539 Mean :0.08747 Mean :15.87
## 3rd Qu.: 2.600 3rd Qu.:0.09000 3rd Qu.:21.00
## Max. :15.500 Max. :0.61100 Max. :72.00
## total.sulfur.dioxide density pH sulphates
## Min. : 6.00 Min. :0.9901 Min. :2.740 Min. :0.3300
## 1st Qu.: 22.00 1st Qu.:0.9956 1st Qu.:3.210 1st Qu.:0.5500
## Median : 38.00 Median :0.9968 Median :3.310 Median :0.6200
## Mean : 46.47 Mean :0.9967 Mean :3.311 Mean :0.6581
## 3rd Qu.: 62.00 3rd Qu.:0.9978 3rd Qu.:3.400 3rd Qu.:0.7300
## Max. :289.00 Max. :1.0037 Max. :4.010 Max. :2.0000
## alcohol quality
## Min. : 8.40 Min. :3.000
## 1st Qu.: 9.50 1st Qu.:5.000
## Median :10.20 Median :6.000
## Mean :10.42 Mean :5.636
## 3rd Qu.:11.10 3rd Qu.:6.000
## Max. :14.90 Max. :8.000
Another interesting approach to check the data is to visualize the distribution of each feature. This can be achieved with Histogram Plots, as it is showed bellow:
As mentioned above, the data distribution analysis can be really helpful to have a quick overview about our data and it’s boundaries.
We can classify the distributions as a:
Normal distribution: Gaussian distribution (also known as normal distribution) is a bell-shaped curve, and it is assumed that during any measurement values will follow a normal distribution with an equal number of measurements above and below the mean value. (Source:https://bit.ly/2S9dYiD)
Right skewed or positive skewed distribution: In such a distribution, the mean is greater than median which in turn is greater than the mode (i.e.; mean > median > mode); in which case the skewness is greater than zero. (Source: https://en.wikiversity.org/wiki/Skewness)
Left skewed or negative skewed distribution: In such a distribution, the mean is lower than median which in turn is lower than the mode (i.e.; mean < median < mode); in which case the skewness is lower than zero. (Source: https://en.wikiversity.org/wiki/Skewness)
Based on the definitions above and on the summary(mean and median), we can classify the wine properties into of the one distributions mentioned above:
To have a quick look over the correlation between two features, it’s possible to plot a matrix of plots and values. This plot is showed bellow:
Since the Matrix Plot is a little bit hard to see and the correlation numbers are sliced, I decided to generate a Scatter Plot of the Wine Quality against each feature, with the mean and median also in the plot. Also, I calculated the correlation between Wine Quality and each of the features.
Another comparision that I decided to make was a Boxplot for each of the Wine Quality against the each feature. To do this plot, I needed to add a new variable to our dataset named grade_number which corresponds to a categorical variable. This helped me to see the variaton of the data for each of the Wine Quality.
The plots and calculation are showed bellow:
##
## Pearson's product-moment correlation
##
## data: wines$quality and wines$fixed.acidity
## t = 4.996, df = 1597, p-value = 6.496e-07
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## 0.07548957 0.17202667
## sample estimates:
## cor
## 0.1240516
##
## Pearson's product-moment correlation
##
## data: wines$quality and wines$volatile.acidity
## t = -16.954, df = 1597, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## -0.4313210 -0.3482032
## sample estimates:
## cor
## -0.3905578
##
## Pearson's product-moment correlation
##
## data: wines$quality and wines$citric.acid
## t = 9.2875, df = 1597, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## 0.1793415 0.2723711
## sample estimates:
## cor
## 0.2263725
##
## Pearson's product-moment correlation
##
## data: wines$quality and wines$residual.sugar
## t = 0.5488, df = 1597, p-value = 0.5832
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## -0.03531327 0.06271056
## sample estimates:
## cor
## 0.01373164
##
## Pearson's product-moment correlation
##
## data: wines$quality and wines$chlorides
## t = -5.1948, df = 1597, p-value = 2.313e-07
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## -0.17681041 -0.08039344
## sample estimates:
## cor
## -0.1289066
##
## Pearson's product-moment correlation
##
## data: wines$quality and wines$free.sulfur.dioxide
## t = -2.0269, df = 1597, p-value = 0.04283
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## -0.099430290 -0.001638987
## sample estimates:
## cor
## -0.05065606
##
## Pearson's product-moment correlation
##
## data: wines$quality and wines$total.sulfur.dioxide
## t = -7.5271, df = 1597, p-value = 8.622e-14
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## -0.2320162 -0.1373252
## sample estimates:
## cor
## -0.1851003
##
## Pearson's product-moment correlation
##
## data: wines$quality and wines$density
## t = -7.0997, df = 1597, p-value = 1.875e-12
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## -0.2220365 -0.1269870
## sample estimates:
## cor
## -0.1749192
##
## Pearson's product-moment correlation
##
## data: wines$quality and wines$pH
## t = -2.3109, df = 1597, p-value = 0.02096
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## -0.106451268 -0.008734972
## sample estimates:
## cor
## -0.05773139
##
## Pearson's product-moment correlation
##
## data: wines$quality and wines$sulphates
## t = 10.38, df = 1597, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## 0.2049011 0.2967610
## sample estimates:
## cor
## 0.2513971
##
## Pearson's product-moment correlation
##
## data: wines$quality and wines$alcohol
## t = 21.639, df = 1597, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## 0.4373540 0.5132081
## sample estimates:
## cor
## 0.4761663
The results obtained from the correlation analysis, between each feature and the Quality, were:
| Feature | Correlation | Orientation | Strength |
|---|---|---|---|
| fixed.acidity | 0.124 | Positive | Very Weak |
| volatile.acidity | -0.3905 | Negative | Weak |
| citric.acid | 0.2264 | Positive | Weak |
| residual.sugar | 0.0137 | Positive | Very Weak |
| chlorides | -0.1289 | Negative | Very Weak |
| free.sulfur.dioxide | -0.0506 | Negative | Very Weak |
| total.sulfur.dioxide | -0.1851 | Negative | Very Weak |
| density | -0.1749 | Negative | Very Weak |
| pH | -0.0577 | Negative | Very Weak |
| sulphates | 0.2514 | Positive | Weak |
| alcohol | 0.4762 | Positive | Medium |
Table Interpretation: - Strength: Very Weak(0 ~ 0.2), Weak(0.21~0.4), Medium(0.41~0.6), Strong(0.61~0.8) , Very Strong(0.8~1.0); - Orientation: Positive, Negative;
From the results we can highlight the correlations between Quality and Alcohol, Quality and Volatile Acidity, Quality and Sulphates, and Quality and Citric Acid.The other correlations have really small values, which indicates that don’t have a big impact in the Wine Quality result.
In this section, it’s necessary to generate more complex plots, by adding color to the points. This adds a new layer and open the path to the analysis of three variables, instead of only two as was did in the previous plots. For this analysis, we’ll use the variables that had the bigger values for the correlation with the Wine Quality, which from the table above are, alcohol, volatile.acidity, sulphates and citric.acid.
## $title
## [1] "Alcohol x Volatile Acidity by Quality color"
##
## attr(,"class")
## [1] "labels"
##
## Pearson's product-moment correlation
##
## data: wines$volatile.acidity and wines$alcohol
## t = -8.2546, df = 1597, p-value = 3.155e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## -0.2488416 -0.1548020
## sample estimates:
## cor
## -0.202288
##
## Pearson's product-moment correlation
##
## data: wines$sulphates and wines$alcohol
## t = 3.7568, df = 1597, p-value = 0.0001783
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## 0.04477906 0.14196454
## sample estimates:
## cor
## 0.09359475
##
## Pearson's product-moment correlation
##
## data: wines$citric.acid and wines$alcohol
## t = 4.4188, df = 1597, p-value = 1.059e-05
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## 0.06121189 0.15807276
## sample estimates:
## cor
## 0.1099032
##
## Pearson's product-moment correlation
##
## data: wines$sulphates and wines$volatile.acidity
## t = -10.804, df = 1597, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## -0.3060917 -0.2147125
## sample estimates:
## cor
## -0.2609867
##
## Pearson's product-moment correlation
##
## data: wines$citric.acid and wines$volatile.acidity
## t = -26.489, df = 1597, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## -0.5856550 -0.5174902
## sample estimates:
## cor
## -0.5524957
##
## Pearson's product-moment correlation
##
## data: wines$citric.acid and wines$sulphates
## t = 13.159, df = 1597, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## 0.2678558 0.3563278
## sample estimates:
## cor
## 0.31277
The results of the correlation between the analised variables are:
| Correlation | alcohol | volatile.acidity | sulphates | citric.acid |
|---|---|---|---|---|
| alcohol | - | -0.2023 | 0.0936 | 0.1099 |
| volatile.acidity | -0.2023 | - | -0.2609 | -0.5525 |
| sulphates | 0.0936 | -0.2609 | - | 0.31277 |
| citric.acid | 0.1099 | -0.5525 | 0.31277 | - |
quality is higher for a bigger alcohol and it decreases as we go to bigger values for the volatile.acid. So the wine quality and alcohol value, decreases as we increase the volatile.acidity.sulphates seems to be smaller for wines with less alcohol, and since the strongest correlation that we got to wine quality was alcohol, in general we have less quality for wines with less concentration of sulphates;quality related to the increase of the citric.acid. But to me, it’s only a weak relation, while it’s clear that the increase in the quality is more related to the alcohol concentration;volatile.acidity we get an increase in the wine quality. Also, when the concentration of sulphates increases, in general, the volatile.acidity decreases;citric.acid, we have small concentration of volatile.acidity and a better wine quality. This is also showed in the correlation value that was obtained between citric.acid and volatile.acidity, -0.5525, which shows an median to strong negative correlation;quality isn’t that strong, I couldn’t see a clear combination to the wine quality;This plot showed the negative correlation between the volatile.acidity and the wine quality.
This plot showed the correlation between the feature alcohol and the wine quality. This plot was really important to verify the correlation between the variable and the quakity, with the biggest value of the correlation.
This was the most unexpected result that I got in the analysis. Turns out that the lower the value for the volatile.acidity, the bigger is the quality and the citric.acid value.
The hardest part of this analysis was that there’s no clear relation between one or two features with the wine quality, and because it’s hard to answer the question that was made in the beggining of this project:
Which chemical properties influence the quality of red wines?
Even that this was a hard question to answer, it was possible to see that the feature more related to wine quality was the alcohol. It has the biggest correlation value and the plots showed that, in general, the bigger is the value of the alcohol, the better the wine is considered.
Also, it’s important to say that the variables volatile.acidity, sulphates and citric.acid have some degree of influence on the wine quality.
For instance, these 4 features mentioned above, never were in my thoughts as the more relevant to the wine quality. I thought it would be pH and residual sugar the ones that actually had an effect in the perception of the wine’s quality. To me this showed how important is Exploratory Data Analysis to have a clear undertanding of informations and that guesses can be completely wrong from the actual information contained in data.